Pokemon dataset contains information on a total of 801 Pokemon.
it includes:
Data are downloaded from: https://www.kaggle.com/rounakbanik/pokemon/data
D=read.csv('Pokemon.csv',header = T,row.names = 1)
library(DT)
datatable(D, rownames = 1, filter="top", options = list(pageLength = 5, scrollX=T) )## names percentage_male height_m weight_kg
## Abomasnow : 1 Min. : 0.00 Min. : 0.100 Min. : 0.10
## Abra : 1 1st Qu.: 50.00 1st Qu.: 0.600 1st Qu.: 9.00
## Absol : 1 Median : 50.00 Median : 1.000 Median : 27.30
## Accelgor : 1 Mean : 55.16 Mean : 1.164 Mean : 61.38
## Aegislash : 1 3rd Qu.: 50.00 3rd Qu.: 1.500 3rd Qu.: 64.80
## Aerodactyl: 1 Max. :100.00 Max. :14.500 Max. :999.90
## (Other) :795 NA's :98 NA's :20 NA's :20
## hp attack defense speed
## Min. : 1.00 Min. : 5.00 Min. : 5.00 Min. : 5.00
## 1st Qu.: 50.00 1st Qu.: 55.00 1st Qu.: 50.00 1st Qu.: 45.00
## Median : 65.00 Median : 75.00 Median : 70.00 Median : 65.00
## Mean : 68.96 Mean : 77.86 Mean : 73.01 Mean : 66.33
## 3rd Qu.: 80.00 3rd Qu.:100.00 3rd Qu.: 90.00 3rd Qu.: 85.00
## Max. :255.00 Max. :185.00 Max. :230.00 Max. :180.00
##
## base_egg_steps base_happiness capture_rate experience_growth
## Min. : 1280 Min. : 0.00 Min. : 3.00 Min. : 600000
## 1st Qu.: 5120 1st Qu.: 70.00 1st Qu.: 45.00 1st Qu.:1000000
## Median : 5120 Median : 70.00 Median : 60.00 Median :1000000
## Mean : 7191 Mean : 65.36 Mean : 98.76 Mean :1054996
## 3rd Qu.: 6400 3rd Qu.: 70.00 3rd Qu.:170.00 3rd Qu.:1059860
## Max. :30720 Max. :140.00 Max. :255.00 Max. :1640000
## NA's :1
## sp_attack sp_defense type1 generation
## Min. : 10.00 Min. : 20.00 water :114 Min. :1.00
## 1st Qu.: 45.00 1st Qu.: 50.00 normal :105 1st Qu.:2.00
## Median : 65.00 Median : 66.00 grass : 78 Median :4.00
## Mean : 71.31 Mean : 70.91 bug : 72 Mean :3.69
## 3rd Qu.: 91.00 3rd Qu.: 90.00 psychic: 53 3rd Qu.:5.00
## Max. :194.00 Max. :230.00 fire : 52 Max. :7.00
## (Other):327
## is_legendary
## Min. :0.00000
## 1st Qu.:0.00000
## Median :0.00000
## Mean :0.08739
## 3rd Qu.:0.00000
## Max. :1.00000
##
NbCLust will be used to determine the optimal number of clusters.
library('missMDA')
impData=imputePCA(D[,3:14], ncp = 2, scale = TRUE, method = c("Regularized","EM")) # impute missing data
ImputData=data.frame(cbind(impData[["completeObs"]],D[,16:17]))#
X=scale(ImputData[,1:12]) # contains charasteristics of each pokemon
library(NbClust)
res_nbclust<-NbClust(X,min.nc = 2, max.nc = 20, index="silhouette",method = "kmeans")
res_nbclust$All.index## 2 3 4 5 6 7 8 9 10 11
## 0.2238 0.2342 0.2334 0.2011 0.1504 0.1480 0.1147 0.1238 0.1206 0.1339
## 12 13 14 15 16 17 18 19 20
## 0.1273 0.1380 0.1335 0.1258 0.1206 0.1282 0.1303 0.1269 0.1289
In the following, 2 is considered an optimal cluster’s number with a silhouette index equal to 0.2238.
## Loading required package: ggplot2
## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
## Loading required package: sp
## Loading required package: maps
## Loading required package: shapefiles
## Loading required package: foreign
##
## Attaching package: 'shapefiles'
## The following objects are masked from 'package:foreign':
##
## read.dbf, write.dbf
km=kmeans(X,2)
f=fviz_cluster(list(data = X, cluster = km$cluster), geom = "point", stand = FALSE, palette = "jco")
fclass=km$cluster
ImputData=cbind(ImputData,class)
d=data.frame(km$centers)
rand.index(ImputData$is_legendary,class)## [1] 0.5135955
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
Variables from height_m to sp_defense are the quantitatives variables for the PCA.
Supplementary qualitative variable are generation and ‘is_legendary’.
## eigenvalue percentage of variance cumulative percentage of variance
## comp 1 4.6406574 38.672145 38.67214
## comp 2 1.3770799 11.475666 50.14781
## comp 3 1.2355183 10.295986 60.44380
## comp 4 0.8768248 7.306873 67.75067
In our case, we are studying 3 dimensions (60.44% of the information with an eigenvalue >1)
## corrplot 0.84 loaded
fviz_pca_var(res.pca, col.var = "cos2",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),repel = T
)